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Abstract 



This survey is an introduction to positive definite kernels and the set 
of methods they have inspired in the machine learning literature, namely 
kernel methods. We first discuss some properties of positive definite ker- 
nels as well as reproducing kernel Hibert spaces, the natural extension of 
the set of functions {k(x, -),x £ X} associated with a kernel k defined on 
a space X. We discuss at length the construction of kernel functions that 
take advantage of well-known statistical models. We provide an overview 
of numerous data-analysis methods which take advantage of reproducing 
kernel Hilbert spaces and discuss the idea of combining several kernels to 
improve the performance on certain tasks. We also provide a short cook- 
book of different kernels which are particularly useful for certain data- 
types such as images, graphs or speech segments. 

Remark: This report is a draft. Comments and suggestions will be highly 
appreciated. 

Summary 

We provide in this survey a short introduction to positive definite kernels and 
the set of methods they have inspired in machine learning, also known as kernel 
methods. The main idea behind kernel methods is the following. Most data- 
inference tasks aim at defining an appropriate decision function / on a set of 
objects of interest X. When A" is a vector space of dimension d, say R d , linear 
functions f a {x) = a T x are one of the easiest and better understood choices, no- 
tably for regression, classification or dimensionality reduction. Given a positive 
definite kernel k on X, that is a real-valued function on X x X which quantifies 
effectively how similar two points x and y are through the value k(x,y), kernel 
methods are algorithms which estimate functions / of the form 
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where (xj)jg/ is a family of known points paired with (ai)jgj, a family of real 
coefficients. Kernel methods are often referred to as 

• data-driven since the function / described in Equation ([T]) is an expansion 
of evaluations of the kernel k on points observed in the sample /, as 
opposed to a linear function a T x which only has d parameters; 

• non-parametric since the vector of parameters (a.i) is indexed on a set / 
which is of variable size; 

• non-linear since k can be a non-linear function such as the gaussian kernel 
k{x,y) = exp(— \\x — y|| 2 /(2er 2 )), and result in non-linear compounded 
functions /. 

• easily handled through convex programming since many of the optimization 
problems formulated to propose suitable choices for the weights a involve 
quadratic constraints and objectives, which typically involve terms of the 
sort a T Ka where if is a positive semi-definite matrix of kernel evaluations 
[k(xi, Xj)]. 

The problem of defining all of the elements introduced above, from the kernel 
k to the index set / and most importantly the weights cti has spurred a large 
corpus of literature. We propose a survey of such techniques in this document. 
Our aim is to provide both theoretical and practical insights on positive definite 
kernels. 

This survey is structured as follows: 

• We start this survey by giving an overall introduction to kernel methods 
in Section [T] and highlight their specificity. 

• We provide the reader with the theoretical foundations that underlie pos- 
itive definite kernels in Section [21 introduce reproducing kernel Hilbert 
spaces theory and provide a discussion on the relationships between posi- 
tive definite kernels and distances. 

• Section [3] describes different families of kernels which have been covered in 
the literature of the last decade. We also describe a few popular techniques 
to encode prior knowledge on objects when defining kernels. 

• We follow with the exposition in Section [J] of popular methods which, 
paired with the definition of a kernel, provide estimation algorithms to 
define the weights a% of Equation (JTJ . 

• Selecting the right kernel for a given application is a practical hurdle when 
applying kernel methods in practice. We provide a few techniques to do 
so in Section [5j notably parameter tuning and the construction of linear 
mixtures of kernels, also known as multiple kernel learning. 

• We close the survey by providing a brief cookbook of kernels in Section [6l 
that is a short description of kernels for complex objects such as strings, 
texts, graphs and images. 
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This survey is built on earlier references, notably (IScholkopf and Smolal . 12002 



Scholkopf et al. , 2004 ; Shawe- Taylor and CristianinJ 2004) . Whenever adequate 
we have tri ed to enrich this presentation with slightly more th eoretical in- 

Berlinet and Thomas- Agnanl 120031 ) , notably in 



1984; 



sights from (jBerg et al. 

Sec tion [H Top i cs cov ered in this survey overlap with some of the sections 



of (jMuller et all l200ll ) and more recently (jHofmann et all 120081) . The latter 



references cover in more detail kernel machines, such as the support vector ma- 
chine for binary or multi-class classification. This presentation is comparatively 
tilted towards the study of positive definite kernels, notably in Sections [2] and 

El 
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1 Introduction 



The automation of data collection in most human activities, from industries, 
public institutions to academia, has generated tremendous amounts of observa- 
tional data. In the same time, computational means have expanded in such a 
way that massive parallel clusters are now an affordable commodity for most lab- 
oratories and small companies. Unfortunately, recent years have seen an increas- 
ing gap of efficiency between our ability to produce and store these databases 
and our the analytical tools that are needed to infer knowledge from them. 
This long quest to understand and analyze such databases has spurred in the 
last decades fertile discoveries at the intersection of mathematics, statistics and 
computer science. 

One of the most interesting changes brought forward by the abundance of 
data in recent years lies arguably in the increasing diversity of data structures 
practitioners are now faced with. Some complex data types that come from 
real-life applications do not translate well into simple vectors of features, which 
used to be a de facto requirement for statistical analysis up to four decades 
ago. When the task on such data types can be translated into elementary 
subtasks that involve for instance regression, binary or multi-class classification, 
dimensionality reduction, canonical correlation analysis or clustering, a novel 
class of algorithms popularized in the late nineties and known as kernel methods 
have proven to be effective, if not reach state-of-the art performance on many 
of these problems. 



statistics, functional analysis and computer science: the mathemati- 
cal machinery of kernel methods can be trace d back to the sem inal presen- 
tation of reproducing kernel H ilbert spaces by Aronszain ( 1950l ) and its use 
in non-parametric statistics by iParzen (1962). However, their recent popular- 
ity in machine learning comes from recent innovations in both the design of 
kernels geared towards specific applications such as the one we cover in Sec- 
tion [6l paired with efficient kernel machines as introduced in Section |4j Ex- 
amples of the la tter include algorithms s uch as gaussian processes with sparse 
repres entations (ICsato and Qpperl . 120021 ) or the popular support vector ma- 
chine (Corte s and Vapnikl . \l99m 7 The theoretical .ju stifications for suc h tools 



can be found in the statistical learning literature (jCucker and Smald . 12002 



Vapnikl . I1998T ) but also in subse quent convergenc e and consistency analysis 



carried out fo r specific techniques ( Fukumizu et al. . 2007 : Vert and Vertl. 2005 : 

Bach , 2008bl). Kernel d e sign embodies th e research trend pionneered in Jaakkola and Haussler 



( 19991) : lHausslen (Il999h : IWatkiral (|2000l) of incorporating contextual knowledge 
on the objects of interest to define kernels. 

Two features of kernel methods have been often quoted to explain the prac- 
tical success of kernel methods. First, kernel methods can handle efficiently 
complex data types through the definition of appropriate kernels. Second, ker- 
nel methods can handle data which have multiple data representations, namely 
multimodal data. Let us review these claims before introducing the mathemat- 
ical definition of kernels in the next section. 
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1.1 Versatile Framework for Structured Data 



Structured objects such as (to cite a few) strings, 3D structures, trees and net- 
works, time-series, histograms, images, and texts have become in an increasing 
number of applications the de facto inputs for data analysis algorithms. The 
originality of kernel methods is to address this diversity through a single ap- 
proach. 



from n points to n x n similarity matrices: using kernel methods on a 
dataset usually involves choosing first a family of similarity measures between 
pairs of objects. Irrespective of the initial complexity of the considered objects, 
dealing with a learning problem through kernels is equivalent to translating a set 
of n data points into a symmetric and positive definite n x n similarity matrix. 
This matrix will be the sole input used by the kernel algorithm, as schemati- 
cally shown on Figur ell.l| This is very s imilar to the k-nearest neighbor (k-NN) 



framework (see [§13]( Hastie et al. . 200li ) for a survey) where only distances be- 



tween points matter to derive decision functions. On the contrary, parametric 
approaches used in statistics and neural networks impose a functional class be- 
forehand (e.g. a family of statistical models or a neural architecture), which 
is either tailored to fit vectorial data - which in most cases requires a feature 
extraction procedure to avoid large or noisy vectorial representations - or tai- 
lored to fit a particular data type (hidden Markov models with strings, Markov 
random fields with images, parametric models for time series with given lags 
and seasonal corrections etc.). In this context, practitioners usually give kernel 
methods different credits, among them the fact that 

• Defining kernel functions is in general easier that designing an accurate 
generative model and the estimation machinery that goes along with 
it, notably the optimization mechanisms and/or bayesian computational 
schemes that are necessary to make computations tractable. 

• Efficient kernel machines, that is algorithm which use directly as an in- 
put Kernel matrices, such as the SVM or kernel-PCA, are numerous and 
the subject of separate research. Their wide availability under the form 
of software packages, makes them simple to use once a kernel has been 
defined. 

• Kernel methods share initially the conceptual simplicity of k-nearcst neigh- 
bors which make them popular when dealing with high-dimensional and 
challenging datasets for which little is known befo rehand, such as the 
study of long sequences in bioinformatics Vert ( 2006h . On the other hand, 



kernel algorithms offer a wider scope than the regression/classification ap- 
plications of k-NN and also provide a motivated answer to control the 
bias/ variance tradeoff of the decision function through penalized estima- 
tion, as explained in Section [ 
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Figure 1: Given a dataset in a given space X, represented as {xi, xi, X3, X4} in the figure above, 
the kernel approach to data analysis involves representing these points through a positive-definite 
symmetric matrix of inter-similarities between points, as in the matrix Ki X 4 in the figure on the 
right. Given a new point xs, any prediction with respect to £5 (as in regression or classification for 
instance) will be a direct function of the similarity of £5 to the learning set {xi, x%, £3, X4}. Thus, 
and in practice, kernel methods rely exclusively, both in the training phase and the actual use of 
the decision function, on similarity matrices. 



1.2 Multimodality and Mixtures of Kernels 

In most applications currently studied by practitioners, datasets are increasingly 
multimodal. Namely, described objects of interest through the lens of different 
representations. 

For instance, a protein can be seen as an amino-acid sequence, a macro- 
molecule with a 3D-structure, an expression level in a DNA-chip, a node in a 
biological pathway or in a phylogcnetic tree. A video segment might be char- 
acterized by its images, its soundtrack, or additional information such as when 
it was broadcasted and on which channel. The interrelations between these 
modalities and the capacity to integrate them is likely to prove helpful for most 
learning tasks. Kernel methods provide an elegant way of integrating multi- 
modalities through convex kernel combinations. This combination takes usually 
place before using a kernel machine as illustrated in Figure 11.21 This stands in 
stark contrast to other standard techniques which usually aggregate decision 
functions trained on the separated modalities. A wide range of techniques have 
been designed to do so through convex optimizatio n and the use of unlabcllcd 



data (jLanckriet et all 12004 : ISindhwani et all 120051 ) . Kernels can thus be seen 



as atomic elements that focus on certain types of similarities for the objects, 
which can be combined through so-called multiple kernel learning methods as 
will be exposed more specifically in Section 15.21 
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Dataset of proteins {pi,P2,P3, • ■ • ,Pn} 
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Figure 2: A dataset of proteins can be regarded in (at least) three different ways: as a dataset 
of 3D structures, a dataset of sequences and a set of nodes in a network which interact with each 
other. A different kernel matrix can be extracted from each datatype, using known kernels on 3D 
shapes, strings and graphs. The resulting kernels can then be combined together with arbitrary 
weights, as is the case above where a simple average is considered, or estimated weights, which is 
the subject of Section 15.21 
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2 Kernels: a Mathematical Definition 



2.1 Positive Definiteness 



Let us start this section by providing the reader with a definition for kernels, 
since the term "kernel" itself is used in different branches of mathematics, from 
linear algebra, density estimation to integral operators theory. Some classical 
kernels used in non-parametric statistics, such as the Epancchnikov kernel 1 , are 
not, for instance, kernels in the sense of the terminology adopted in this report. 
We develop in this secti on elementary insights on kernels, combining di f ferent 



presentations given i n ( Berlinet and Thomas- Agnan . 20031 : Berg et al. . 1984 



Scholkopf and Smolal . I2002T ) to which the reader may refer for a more complete 
exposition. 



basic mathematic definition: let X be a non-empty set sometimes referred 
to as the index set, and k a symmetric real-valued 2 function on X x X . For 
practitioners of kernel methods, a kernel is above all a positive definite function 
in the following sense: 

Definition 1 (Real- valued Positive Definite Kernels) A symmetric func- 
tion k : X x X — > R is a positive definite (p.d.) kernel on X if 

n 

CjCjk(xj,Xj) > 0, (2) 
holds for any n £ IN, x\, . . . , x n G X and c\ . . . , c n £ K. 



kernel matrices derived from kernel functions: one can easily deduce 
from Definition [1] that the set of p.d. kernels is a closed, convex pointed cone 3 . 
Furthermore, the positive definiteness of kernel functions translates in practice 
into the positive definiteness of so called Gram matrices, that is matrices of 
kernel evaluations built on a sample of points X = {xi}i^i in X, 

K x = [k{xi,Xj)]^j . 

Elementary properties of the set of kernel functions such as its closure under 
pointwise and tensor products are directly inherited from well known resul ts in 
Kronecker and Schur (or Hadamard) algebras of matrices (Bernstein, 20051 §7). 



kernel matrices created using other kernel matrices: kernel matrices for 
a sample X can be obtained by applying transformations r that conserve positive 
definiteness to a prior Gram matrix Kx- In such a case the matrix r{Kx) can 

1 forft>0, k h (x,y)='i(l~(^yy 

2 kerncls are usually complex valued in the mathematical literature; we only consider the 
real case here, which is the common practice in machine learning. 

3 A set C is a cone if for any A > 0, x G C => Xx £ C, pointed if x £ C, —x £ C => x = 



9 



be used directly on that subspace, namely without having to define explicit 
formulas for the constructed kernel on the whole space X x X. A basic example 
is known as the empirica l kernel map, where t he square map r : M — > M 2 



can be used on a matrix ( Scholkopf et al. . 20021 ). More complex constructions 



are the computati on of the diffusion ke r nel on elements of a graph through its 
Laplacian matrix (jKondor and Laffertvl . 2002 ), or direct transfo rmations of the 



kernel matrix through unlabclled data (jSindhwani et all 120051 ) 



strict and semi-definite positiveness: functions for which the sum in Equa- 
tion |2) is (strictly) positive when c ^ are sometimes referred to as positive 
definite functions, in contrast with functions for which this sum is only non- 
negative, which are termed positive semi-definite. We will use for convenience 
throughout this report the term positive definite for kernels that simply com- 
ply with non-negativity, and will consider indifferently positive semi-definite and 
positive definite functions. Most theoretical results that will be presented in this 
report are also indifferent to this distinction, and in numerical practice definitc- 
ness and semi-definitcness will be equivalent since most estimation procedures 
consider a regularization of some form on the matrices to explicitly lower bound 
their conditioning number 4 . 



the importance of positive definiteness : Equation distinguishes gen- 
eral measures of similarity between objects and a kernel function. The require- 
ment of Equation ((2|) is important when seen from (at least) two perspective. 
First, the usage of positive definite matric es is a key assumption in convex pro- 
gramming |Bojd_^iid_Vaiid^nb^rghg (|2004l ). In practice the positive definiteness 
of kernel matrices ensures that kernel algorithms such as Gaussian processes or 
support vector machines converge to a relevant solution 5 Second, the positive 
definiteness assumption is also a key assumption of the functional view described 
below in reproducing kernel Hilbert spaces theory 



2.2 Reproducing Kernels 

Kernels can be also viewed from the functional analysis viewpoint, since to each 
kernel k on X is associated a Hilbert space Hk of real- valued functions on X. 

Definition 2 (Reproducing Kernel) A real-valued function k : X x X — > M 
is a reproducing kernel of a Hilbert space Tt of real-valued functions on X if and 
only if 

i)VteX, k(-,t) e H; 
U)VteX,VfeH, (f,k(-,t)) = f(t). 

Condition (ii) above is called the reproducing property. A Hilbert space that is 
endowed with such a kernel is called a reproducing kernel Hilbert space (rkHs) 

4 t hat is the ratio o f the biggest to the smal l est eig envalue of a matrix 

5 iHaasdonld . 120051; E uss and D'Asprc mont. 200§j) show however that arbitrary similarity 
measures can be used with slightly modified kernel algorithms 
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or a proper Hilbert space. Conversely, a function on X x X for which such 
a Hilbert space TL exists is a reproducing kernel and we usually write Ti.k for 
this space which is unique. It turns out that both Definitions [1] and [3] are 
equivalent, a result known as the Moore- Aronszajn theorem ( Aronszainl . I1950T ). 
First, a reproducing kernel is p.d., since it suffices to write the expansion of 
Equation to obtain the squared norm of the function Y^i=i c ik{xi, ■), that is 



»,i=i 



5 Cik{x%, •) 



i=l 



(3) 



7i 



which is non-negative. To prove the opposite in a general setting, that is not lim- 
ited to the case where X is c ompact which i s the starting hypothesis of the Mer- 
cer r epresentation theorem ([Mercer . Il909h reported in (jScholkopf and Smolal . 
20021 p. 37), we refer the reader to the progressive con struction of the rkHs asso- 
ciate d with a kernel k and its index set X presented in (jBerlinet and Thomas- Agnan 
20031 §1.3). In practice, the rkHs boils down to the completed linear span of 
elementary functions indexed by X , that is 



7~tk = span{fc(x, 



),xe X}, 



whereby completeness we mean that all Cauchy sequences of functions converge. 



the parallel between a kernel and a rkHs: Definition[2]may seem theoret- 
ical at first glance, but its consequences are are however very practical. Defining 
a positive definite kernel k on any set X suffices to inherit a Hilbert space of 
functions 7ik which may be used to pick candidate functions for a given data- 
analysis task. By selecting a kernel k, we hope that the space Ttk - though 
made up of linear combinations of elementary functions - may contain useful 
functions with low norm. This is in many ways equivalent to defining a space of 
low degree polynomials and its dot-product in order to approximate an arbitrary 
function of interest on a given interval [a, b] on the real line with a polynomial 
of low norm. 



functional norm in a rkHs: another crucial aspect of rkHs is the simplic- 
ity of their induced norms and dot-products which arc both inherited from the 
reproducing kernel. The fact that this norm is easy to compute for finite ex- 
pansions, as seen in Equation ([3]), is an important property which has direct 
implications when considering regularized estimation schemes, introduced in 
Section 14.31 and more precisely Equation (fT5j) . Additionally, the dot-product 
between two functions in the rkHs can be expressed as 

( ajk(xi, ■), bjkjyj, ■)) = ^2 "<M' : /''< 
\iei je.J I ieije.J 

which only depend on kernel evaluations on pairs (xt, yj) and on the weights a; 
and bj. The fact that in Hk the dot-product (k(x, •), k(y, is equal to k(x, y) 
illustrates an alternative view, namely that a kernel is a disguised dot-product. 
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2.3 Kernels as Feature Maps 



The theorem below (jBerlinet and Thomas-Agnanl . 120031 . p. 22) gives an interpre- 



tation of kernel functions, seen as dot-products between feature representations 
of their arguments in a space of sequences. 

Theorem 1 A function k on X x X is a positive definite kernel if and only if 
there exists a set T and a mapping (f> from X to 1 2 {T), the set of real sequences 
{ut,t G T} such that X^teT \ u t\ 2 < °°; where 

V(x,y) G X x X , k(x,y) = ^ (x) t (f> (y) t = (</>(x), <t>{y))p(x) 

teT 

The proof is derived from the fact that for any Hilbert space (notably Ttk) there 
exists a space l 2 (X) to which it is isometric. As can be glimpsed from this 
sketch, the feature map viewpoint and the rkHs one are somehow redundant, 
since 

x i — ► k(x, ■), 

is a feature map by itself. If the rkHs is of finite dimension, functions in the 
rkHs are exactly the dual space of the Euclidian space of feature projections. 
Although closely connected, it is rather the feature map viewpoint than the rkHs 
one which actually spurred most of the initial advoca tion for kernel methods in 



machine learning, notably t he SVM as presented in (jCortes and VapniU . 11995 



Scholkopf and Smolal . l2002h . The the latter references present kernel machines 



as mapping data-entries into high-dimensional feature spaces, 

{xi, ■■■ , x n } h-> {(/>(xi), • ■ ■ , #c„)}, 

to find a linear decision surface to separate the points in two distinct classes 
of interest. This interpretation actually coincided with the practical choice of 
using polynomial kernels 6 on vectors, for which the feature space is of finite 
dimension and well understood as products of monomials up to degree d. 

The feature map approach was progressively considered to be restrictive in 
the literature, since it imposes to consider first the extracted features and then 
compute the kernel that matches them. Furthermore, useful kernels obtained 
directly from a similarity between objects do not always translate into feature 
maps which can be easily describe d, as in diffusion kernels on graphs for in- 
stance ( Kondor and Laffertvl . [200l . Kernels without explicit feature maps may 



also be obtained through the polynomial combination of several kernels. The 
feature map formulation, particularly advocated in the early days of SVM's, also 
misled some observers into thinking that the kernel mapping was but a piece 
of the SVM machinery. Instead, the SVM should be rather seen as an efficient 
computational approach - among many others - deployed to select a "good" 
function / in the rkHs Hk given a learning sample, as presented in Section l4~3l 



3 fc(x, y) = ((x, y) + b) d , d e M, b S K+ 
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2.4 Kernels and Distances, a Discussion 



We discuss in this section possible parallels between positive definite kernels and 
distances. Kernel methods are often compared to distance based methods such 
as nearest neighbors. We would like to point out a few differences between their 
two respective ingredients, kernels k and distances d. 

Definition 3 (Distances) Given a space X , a nonnegative-valued function d 
on X x X is a distance if it satisfies the following axioms, valid for all elements 
x, y and z of X : 

• d(x, y) < 0, and d{x, y) — if and only if x = y. 

• d{x,y) = d(y,x) (symmetry), 

• d(x,z) > d{x,y) + d(y,z) (triangle inequality) 

the missing link between kernels and distances is given by a particular type 
of kernel function, which includes all negations of positive definite kernels as a 
particular case, 

Definition 4 (Negative Definite Kernels) A symmetric function ip : X x 
X — > K is a negative definite (n.d.) kernel on X if 

n 

^ CiCj %p (xi,Xj) < (4) 

holds for any n G IN, x\, . . . , x n G X and ci . . . , c n G M such that Xa=i °i = 0- 

A matricial interpretation of this is that for any set of points xi, ■ ■ ■ ,x n and 
vectors of weights c G R" in the hyperplane {y \ l T y = 0|. we necessarily have 
that c T, 5c < with ^ = [ip(xi,Xj))ij. A particular family of distances known 
as Hil bertian norms can be cons idered as negative definite kernels as pointed 
out i n iHein and Bousauet This link is made explicitly by (jBerg et al 



1984 Proposition 3.2) given below 



Proposition 2 Let X be a nonempty set and ip : X x X be a negative definite 
kernel. Then there is a Hilbert space H and a mapping x ^— > <fi(x) from X to H 
such that 

4>{x, y) = U{x) - 0(y)|| 2 + f(x) + f(y), (5) 

where f : X — > ffi is a real-valued complex function on X. Ifip(x,x) = for all 
x G X then f can be chosen as zero. If the set of pairs such that ip{x,y) = is 
exactly {(x,x), x G X} then \p§ is a distance. 



negative definite kernels and distances: the parallel between negative def- 
inite kernels and distances is thus clear: whenever a n.d. kernel vanishes on the 
diagonal, that is the set {(x,x),x G X}, and is zero only on the diagonal, then its 
square root is a distance for X. More generally, to each negative definite kernel 



13 



corresponds a decomposition © which can be exploited to recover a distance 
given that the function / can be deduced from -0, typically as ^^ ,x ' . On the 
other hand, to each distance does not correspond necessarily a negative definite 
kernel and there are numerous examples of dista nces which are not Hilbertian 



metric such as the Monge-Kantorovich distance (INaor and Schechtmanl . 120071 ) 
or most variations of the edit distance (jVert et all 120041 ). 

negative definite kernels and positive definite kernels: on the other 
hand, n.d. kernels can be identified with a subfamily of p.d. kernels known as 
infinitely divisible kernels. A nonnegative- valued kernels k is said to be infinitely 
divisible if for every ngi there exists a positive definite kernel k n such that 
(fcn) 1//n is positive definite. 

Example 1 A simple example is the usual Gaussian kernel between two vectors 
o/K d since rewriting it as 



k a {x,y) 



suffices to prove this property. 



Here follows a slightly simplified version of (jBerg et all 11984 Proposition 2.7) 
which provides a key interpretation: 

Proposition 3 For a p.d. kernel k > on X x X, the following conditions are 
equivalent 

(i) k is infinitely divisible, 

(ii) — logfc e N(X), 

(Hi) k is positive definite for al t > 0. 

Figure [3] provides a schematic view on the relationships between distances, 
negative definite kernels and positive definite kernels. The reader should also 
keep in mind that 

(i) T>(X) is a cone; 

(ii) M{X) is a cone. Additionally, if %p(x,x) > for all x £ X, if > a and 



log(l + ip) are also in Af(X) for < a < 1 following (jBerg et all 11984 , 
Corollary 2.10); 

(iii) V(X) is a cone. Additionally, ip n is also in V{X) for n £ M, as well as the 
pointwisc product ipitp2 of two p.d. kernels; 



and may refer to (jBerg et allll984l . p. 79) for other numerous results and exam- 
ples. 
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Figure 3: A schematic view of the relationships between the set of distances on X, written as 
T>(X) on the left, its subset of Hilbertian metrics and their one-to-one mapping with a certain 
family of negative-definite kernels vanishing on the diagonal, itself contained in the set N(X) of 
more general negative definite kernels on X. Note that the set of negative-definite kernels is in 
direct correspondence with the set Voo{X), the subset of infinitely divisible positive definite kernels 
oiV(X). 



15 



3 Designing Kernels from Statistical Knowledge 



We follow the relatively theoretical exposition of the previous chapter with a 
more practical exposition. Although the mathematical elements presented above 
explain most of the desirable properties of kernel machines, notably the convex- 
ity of most optimization carried out when estimating kernel machines, the view 
taken by practitioners on kernels is most often linked with that of a similarity 
measure between objects. Namely that for two objects x and y the value k(x, y) 
can be a reliable quantification of how x and y are similar. This similarity may 
be chosen arbitrarily, incorporating as much prior knowledge on the objects as 
possible without any connection to the task itself, or rather considered under 
the light of a given task. 

Example 2 The usual criterion for two texts to be similar might be that they 
share the same languages and/or topics of interest and/or overall length, but a 
very specialized algorithm might solely focus on the occurrence of a single word 
within their body. Two photographs might be qualified as similar if they display 
similar colors or shapes. For other tasks, rather than the image itself their date 
or their location might be the key criterion, regardless of their pictorial content. 
Two videos might be qualified as similar if they display the same person for 
a fixed length of time or if they were broadcasted through the same television 
channel. 

The following sections start with the relatively simple example of defining ker- 
nels on vectors. We address objects with more complex structures later in this 
chapter using statistical modeling. 



3.1 Classical Kernels for Vectors 
3.1.1 Vectors in 1™ 

Finite dimensional Vectors are a fundamental tool to represent natural phe- 
nomenon as numeric data. Vectors are known to be easy to manipulate by both 
algorithms and computer codes, and as such positive definite kernels taking 
vector arguments can be easily constructed. The canonical dot-product on a 
vector-space of finite dimensions, also known as the linear kernel k(x, y) = x ■ y 
is the most fundamental example. We will use three lemmas to show the reader 
how most classical kernels can be easily reconstructed through the linear kernel. 
For a family of kernels k\, . . . , k n , . . . 

• The sum 5^t=i n ^i^i i s positive definite, given Ai, . . . , A„ > 

• The product fc" 1 • • • k®" is positive definite, given oi, . . . , a n € IN 

def 

• The limit k = lirrin^oo k n is positive definite if the limit exists. 



Using these properties listed in (jBerg et all Il984f ). we can reconstruct 
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• the polynomial kernel k p (x, y) = (x ■ y + b) d , b > 0, d <G IN, simply because 
b > is a p.d. kernel, and so is (x ■ y + b) as a consequence of the first 
property, and so is [x ■ y + b) d as a consequence of the second. 



• the Gaussian kernel k a (x,y) 
following form 



which can be rewritten in the 



n*ir _n»ir 

2a' 1 £ la' 1 



i=0 



i! 



The term in the first brackets is trivially a kernel, and so is the term in 
the second part as a limit of positive definite kernels. 



3.1.2 Vectors in W\_ and Histograms 

Histograms are frequently encountered in applications of machine learning to 
real-life problems. Indeed, most natural phenomena produce visible data, which 
the practitioner is likely to count to describe reality. As a consequence, most ob- 
servations are usually available under the form of nonnegative vectors of counts, 
which, if normalized, yield histograms of frequencies. Metrics or divergences for 
general probability measures, the obvio us generalization of histogr ams, is the 
object of study of information geometry I Amari and Nagaokal (|200lh . However, 
as hinted in Section 12.41 a proper understanding of metrics and divergences for 
a certain class of objects cannot be immediately applied to define positive defi- 
nite kernels. Indeed, the Kullback-Leibler divergence, which has a fundamental 
importance in information geometry, cannot be used as such in kernel methods 
as it is neither symmetric nor positive/negative definite. 



el ementary kernels on p ositive measures: it has been shown however 
in iHein and Bousauet ( 2005 ) that the following family of squared metrics, re- 
spectively the Jensen Divergence, the x-square, Total Variation and two varia- 
tions of the Hcllingcr distance, are all negative-definite kernels: 



ip JD (9,6') = h 



h(9) + h(0') 



Vh^M'^EI^-v^I 2 . fe(M') = El 



As a consequence, these metrics can all be used to definite positive definite 
kernels using Proposition [3] and the following formula: 

k(e,e') = e~^, 

with t > 0. Although histograms appear frequently in the study of objects 
such as images, through histograms of colors, and texts, through bags-of-words 
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representations, their usage alone is restrictive when studying objects that carry 
a finer structure. In such a case, probability distributions that are tailored 
to capture better the intcrdcpcndencics between smaller components in those 
objects can be used to define kernels as presented in the next section. 



3.2 Statistical Modeling and Kernels 

Fisher kernel: Jaakkola and Haussler ( 19991 ) first thought of using generative 



models to build kernels that would provide in turn the necessary inputs of dis- 
criminative machines, that is kernel classifiers. Although the principle outlined 
in the next lines can be applied to different pairs of datatypes/generative models, 
we follow the original presen tation of their paper which focused on sequences. 
Jaakkola and Haussler dl999h observed that the hidden Markov model (HMM), 
which is known to capture efficiently the behaviour of amino-acid sequences can 
be used as an efficient feature extractor. The authors did so by defining for each 
considered sequence a vector of features derived from an estimated HMM model, 
namely the Fisher score. Given a measurable space {X , B, v) and a parametric 
family of absolutely continuous measures of X represented by their densities 
{pe, £ G C M. d }, the Fisher kernel between two elements x, y of X is 



dlnpg(x) 
88 



J7 



8\np e (y) 
88 



where 8 is a parameter selected beforehand to match the whole training set, and 
Jg is the Fisher information matrix computed in 8. The statistical model not 
only acts as a feature extractor through the score vectors, but also defines the 
Mahalanobis metric associated with these vectors throug h Jp. We introd u ce the 
following alternative formulation of the kernel quoted in Ijaakkola et~all (|l999h 
using the Vg notation which stands for the gradient of a function computed at 



k'(x,y) = e"^^ 6 Xn Po{x)-V 6 \np e {y)) T J § 1 (Vglnp e ( a; )-V ( ;lnp 8 (y))^ 



(6) 



extensions to the Fisher ker nel: the proposal of the Fisher kernel fos - 
tercd further research, notably in (|Tsuda et all l2002al Smit h and Galesl . l2002h . 
The motivation behind these contributions was to overcome the limiting as- 
sumption that the parameter 8 on which the score vectors are evaluated is 



uni que and fits the whole s et of points at hand. Rather, iTsuda et all (|2002af ) 
Smith and Galesl ( 2002 ) proposed simultaneously to incorporate in the con- 



and 

text of binary classification two parameters 8\ and 82 for each class respectively, 
and consider the score vector of the likelihood ratio between the two classes 
evaluated in x, 

X 



pe 2 O) 



Of) 



0=(0iA) 
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where $ 



?i, 6*2) is in 8 2 , to propose instead the kernel 
(x,y) 1 ^ t&frixffa&iv). 



The Fisher kernel was also studied fro m a theoretical per spective when used in 
conjunction with a logistic regression (jTsuda et all 120041 ) . 

mutual information kernels: the Fisher kernel is r elated to a wid er class of 
kernels coined down as mutual information kernels by ISeeeeri (<2002h . Start mg 
also from a set of distributions {pg, 9 G &} where is measurable, and from a 
given prior to <E £2(6), the mutual information kernel k u between two elements 
x and y is defined as 

k u {x,y) = / pe{x)p e {y)uj{d6). (7) 
Je 

As noted in ( Seeger , 20021 ) , the Fisher kernel can be regarded as a maximum a 
posteriori approximation of the mutual information kernel, by setting the prior 
oj to the multivariate Gaussian density Af(0, J^)i following the approximation 
of Laplace's method. Let us review this claim in more details: given an object 
x and a parameter 9, the following approximation 

logpe(x) « logpj(x) + V § ln Pe (x) T {9 - 6) 

can be rewritten using the notation 

d\npg(x) 1 



$0) = V s hip e {x) 



00 



as 



\ogp e {x) « \ogp § {x) + $(x)(6 - 9). 
Using a Gaussian approximation for u> yields a change in Equation [7] as 

k(x,y) = C [ e ^ep$(x)+^(x) T (9-e)+i oeP g(y)+^(y) T (e-§) e -(e-e) T J s (e-e) de 



c P§ {x)p § {y) / e (*(*)+*(*)) r (e-«)+(9-«) T ^(«-«)de 



(8) 



it is then easy to check that the kernel 

k(x,y) 



k(x,y) 



v / k(x,x)k(y,y) 



Cuturi and Vertl 



is equ al to the Fisher kernel given in its form of Equation (f 
( 2005( ) propose an example of a mutual information kernel defined on strings that 
can be computed exactly. In the latter work the set of distributions {pg 7 9 e 9} 
is a set of Markov chain densities on sequences with finite depths. The prior 
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a; is a combination of branching process priors for the structure of the chain 
and mixtures of Dirichlet priors for the transition parameters. This setting 
yields closed compu tational formu l as for the kernel th rough previous work led 
in universal coding ( Willems et al. . 19951 : Catoni , 2004). The computations can 
be carried in a number of elementary operations that is linear in the lengths of 
the inputs x and y. 



margi nalized kernels 

2002tJ) . and then in comparisons of graphs (jKashima et al 



in the framework o f sequence analysis firs t (jTsuda et al 

20031 ). further at- 



tention was given to latent variable models to define kernels in a way that also 
generalized the Fisher kernel. In a latent variable model, the probability of 
emission of an element x is conditioned by an unobserved latent variable se5, 
where S is a finite space of possible states. When a string is considered under the 
light of a hidden Markov model, to its chain X — X \ ' ' ' Xyi of letters is associated 
a similar sequence s = Si • • • s n of states that is not usually observed. When the 
sequence of states s is known, the probability of x under such a model is then 
determined by the marginal probabilities p(xi\si). Building adequate transition 
structures for the emitting states, and their corresponding emission probabili- 
ties is one of the goals of HMM estimations. The marginalized kernel assumes 
that this sequence is not known for objects x and y, but it performs, given an 
available structure of states, an averaging 

fcfo y) = ^2Yl p( s \ x )p( s '\y) K s ) > (y> s ')) 

ses s'es 

of arbitrary kernel evaluations k weighted by posterior probabilities which are 
estimated from data. In this setting, k can be any arbitrary kernel on X x S. 
For particular choices of k the kernel can be computed in closed form, both on 
sequences and graphs ( Mahe et al. , 2004f ) . 



kernels denned on maximum-likelihood parameters: the previous ap- 
proaches make different uses of a statistical model p$(x). In mutual information 
kernels pg(x) is treated as a feature indexed by a large set of parameters 9 £ O. 
For marginalized kernels an unseen, latent variable is added to pg, pe{x 1 s) and 
while 9 is kept constant the integration of pe(x, s)pe(y, s') is led over all possible 
combinations of latent variables (s, s'). A third approach, conceptually simpler, 
compares two objects by considering directly the parameters 9 and 9' that fits 
them better respectively, that is, map first 

(x,2,)^(My)ee 2 , 

through maximum likelihood estimation for instance, and then compare x and 
y through a kernel k@ on 0, 

k(x,y) = k e (9 x ,9 y ). 

Under this form, the topic of defining i nteresting functions k on (0„ , 9 y ) is loosely 
connected with information geometry (jAmari and Nagaokal l200lh and one may 
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use for simple densities some of the kernels presente d in Section 13. 1.21 Fo r more 
complex spaces of parameters O one may refer to Jebara et al.l ( 2004 ) which 
presents the family of kernels 



x 



.(zfpeAzfdz 



for (3 > 0, the case (3 = \ being the well known Bhattacharrya affinity between 
densities. The authors review a large family of statistical models for which 
these kernels can be computed in closed form, ranging from graphical models, 
Gaussian multivariate densities, multinomials and hidden Markov models. 



information di ffusion kernel: aiming als o at computing k ernels of interest 
on mu ltinomials, Lafferty and Lebanon ( 2005 ) propose to follow lKondor and Laffertv 
( 20021) and use diffusion processes to define kernels. To do so they express solu- 
tions for the heat equation in the Riemannian manifold induced by the Fisher 
met ric of the considered statisti cal models, inspired again by information geom- 
etry Amari and Nagaoka ( 200ll ). They derive information diffusion kernels out 
of such solutions which, when specialized to multinomials, that is elements of 
the simplex 7 , boil down to kernels of the form 



■| arccos 2 (\/0- 6') 



(9) 



where t > is the diffusion parameter. Note that the squared arc-cosine in 
Equation ([9]) is the squared geodesic distance between and 0' seen as elements 
from the unit sp here (that is when e ach 9j is mapped t o \/~Qi). Based on the 
seminal work of ISchoenberS (Il942h . IZhang et alj rather advocate the 

direct use of the geodesic distance: 



fcE„ (M') = e -- arccos(ve ' e >, 

They prove that the geodesic distance is a negative definite kernel on the whole 
sphere, while its square used in Equation ([9]) is not. If the points and 9' are 
restricted to lie in the positive orthant, which is the case for multinomials, both 
approaches yield however positive definite kernels. 



3.3 Semigroup Kernels and Integral Representations 

Most positive definite kernels on groups, which includes kernels on vectors of 
M™ as described in Section (|3 . 1 . 1|) can be considered as semigroup kernels. A 
semigroup is an algebraic structure that is simple enough to fit most datatypes 
and rich enough to allow for a precise study o f the kernels defin ed on them. Most 
of the material of this section is taken from ( Berg et"aTI. 19841). but the interes t 
reader may consult the additional references ( Devinat d . ll955 ; Ehm et al. , 2003 ) . 
Let us start this section with the following definitions. 



7 writing S,j for the canonical simplex of dimension d, i.e., = {£ = 

o,E6 = i}- 



Si > 
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semigroups: a semigroup (S, +) is a nonempty set S endowed with an as- 
sociative composition + which admits a neutral element 0, that is such that 
\/x G <S, a; + = x. An involutivc semigroup (<S, +, *) is a semigroup endowed 
with an involution * which is a mapping S — > S such that for any x in S 
(x*)* = x. Let us provide some examples of semigroups: 

• S is the set of strings formed with letter from a given alphabet, + is the 
concatenation operation, is the empty string and * is either the identity 
or the operation which inverses the order of the letters of a string. 

• S is a group, and * is the inverse operation of the group. (R, +, — ) is a 
typical example. 

• S is the positive orthant R + endowed with the usual addition and * is the 
identity. 

Note that most semigroups considered in the machine learning literature are 
abelian that is operation + is commutative. 



semigroup kernels: a semigroup kernel is a kernel k defined through a complex- 
valued function ip defined on S such that 

k(x,y) d = <p(x + y*). 

A function <p is a positive definite function if the kernel that can be derived from 
it as (p(x + y*) is itself positive definite. When S is a vector space, and hence 
a group, for two elements x,y oi S one can easily check that most elementary 
kernels are either defined as 



k(x,y) = <p(x - y), 

or 

k(x,y) = ip(x + y), (10) 

respectively when * is the minus operation and * is the identity. Kernels build 
on the former structure will typically emphasize the difference between two 
elements, given this difference can be computed, and include as their most im- 
portant example radial basis functions (RBF) and the Gaussian kernel. When a 
subtraction between elements cannot be defined as is the case with strings, his- 
tograms and nonnegative measures, the form of Equation p0[) is b etter suited as 



can b e seen in some of the examples of Section [5.1.2l and studied in (jCuturi et al 



20051) . In this work, the authors study a family of kernels for probability mea- 
sures /I and fjf by looking at their average (/i + fjf)/2. They narrow down their 
study to kernels defined through the variance matrix £( M t/' ) of their average, 
and show that 



detS 
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is a positive definite kernel between the two measures. This result can be further 
extended through reproducing kernel Hilbert space theory, yielding a kernel 
between two clouds of points {x\, . . . , x n } and {yi, . . . , y m } which only depends 
on the kernel similarity matrices K X y = Vj)), Kx = [ j)] and Ky = 



integral representations: semigroup kernels can be expressed as sums of 
scmicharacters, a family of elementary functions on S. A real-valued function 
p on an Abclian semigroup {S, +) is called a semicharacter if it satisfies 

(i) P(0) = 1, 

(ii) Vs,t eS, p{s + t)=p{s)p~(£), 



(iii) V.s g S, P (s) = p{s*). 

The set of semicharacters defined on S is written S* while the set of bounded 
scmicharacters can be written as S. It is trivial to see that every semicharacter 
is itself a positive definite function. The converse is obviously not true, but it 
is possible to show that bounded scmicharacters are the extremal points of the 
cone of bound ed positive definit e functions, therefore providing the following 
result given bv lBerg et al. (1984): 



Theorem 4 (Integral representation of p.d. functions) A bounded func- 
tion ip : S — > R is p.d. if and only if it there exists a non-negative measure u 
on S such that: 

V( s ) = /. P( s ) du(p). 



JS 

In that case the measure lu is unique. 

When S is the Euclidian space ~R d the following results due originally to 
Bochner and Bernstein respectively allow us to characterize kernels for two 
vectors x and y that depend respectively on (x — y) and (x + y). 

identical involution let a kernel k be such that k(x,y) = (p(x — y). Then 
there exists a unique non-negative measure u> on M. d such that 



ip(x) = / e lx r du(r); 

JWL d 

In other words, ip is the Fourier transform of a non-negative measure lo on M. d . 

opposite involution let a bounded kernel k be such that k(x, y) = ip(x + y). 
Then there exists a unique non-negative measure lo on M. d such that 



ip(x) = / e- x r duj(r); 
or in other words ip is the Laplace transform of a non-negative measure us on 
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4 Kernel Machines 



Kernel machines are algorithms that select functions with desirable properties 
in a pre-defined reproducing kernel Hilbert space (rkHs) given sample data. All 
kernel estimation procedures define first a criterion that is a combination of 
possibly numerous and different properties. Subsequently, the element / of the 
rkHs that is the optimum with respect to this criterion is selected following an 
optimization procedure. Before presenting such algorithms, let us mention an 
important theoretical challenge that appears when dealing with the estimation 
of functions in rkHs. 

Let X be a set endowed with a kernel k and Tik its corresponding rkHs. 
Choosing a function in an infinite dimension space such TL can become an ill- 
defined problem when the criterion used to select the function does not have a 
unique minimizer. The representer theorem formulated below provides a prac- 
tical answer to this problem when a regularization term is used along with a 
convex objective. 



4.1 The Representer Theorem 

Most estimation procedures presented in the statistical literature to perform 
dimensionality reduction or infer a decision function out of sampled points rely 
on the optimization of a criterion which is usually carried out over a class of 
linear functionals of the original data. Indeed, PCA, CCA, logistic regression 
and least-square regression and its variants (lasso or ridge regression) all look 
for linear transformations of the original data points to address the learning 
task. When these optimizations are led instead on an infinite dimensional space 
of functions, namely in the rkHs Tik, the optimization can be performed in 
finite subspaces of TL^ if the criterion only depends on a criterion computed on 
a finite sample of points. This result is known as the representer theorem and 
explains why so many linear algorithms can be "kcrnclizcd" when trained on 
finite datasets. 



Theorem 5 (Representer Theorem ( Kimeldorf and Wahbal . Il97lh ) Let 



X be a set endowed with a kernel k andTlk its corresponding rkHs. Let {xi}i<i< n 
be a finite set of points of X and let $ : K n+1 — ► R be any function that is strictly 
increasing with respect to its last argument. Then any solution to the problem 

min *(/(!!),. ■■ J(x n ),\\f\\ Hk ) 

Jfc rlk 

is in the finite dimensional subspace span{fc(xi, ■), 1 < i < n} of Tik- 

The theorem in its original form was cast in a more particular setting, where 
the term H/H?^ would be simply added to an empirical risk as often used in 
Section 14.31 This generalized version is however important to deal with an 
unsupervised setting. 
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4.2 Eigenfunctions in a rkHs of Sample Data Points 



In machine learning, unsupervised learning is a class of problems in which one 
seeks to determine how the data are organized. Indeed, for some applications 
practitioners are interested first in summarizing the information contained in 
their observations rather than inferring a decision function on such data. This 
task can be broadly categorized as dimensionality reduction and can be seen as a 
data-dependent way to summarize the information contained in each datapoint 
to a few numbers. Namely, given a sample X = {x\, ■ ■ ■ ,x n } of points of X 
translate such a set of points into an alternative representation £ = {£1, • • • , £„} 
of such points where each £j is in R d and d has a much lower dimensionality 
than X. 

If both X and y are Euclidian spaces, two popular unsupervized linear 
techniques are of particular interest. 



Principal component analysis (PCA) which aims at defining an orthonor- 
mal basis vi, • ■ ■ , fdim(A') of X such that for 1 < j < dim(X), 

Vj = argmax v&tx[(v, %}x], (H) 

v€.X,\\v\\x=1,v-L{vi,~ ,Vj-i} 

where for any function / : X — > R, varx [/] denotes the empirical variance with 
respect to the points enumerated in X, that is Ex[(f — Ex[f]) 2 ]- The r first 
eigenvectors v\, ■ ■ ■ ,v r are significative since the r dimensional projection of X, 
(vfx, ■ ■ ■ ,v^x) usually suffices to capture most of the variability of the data 
under a Gaussian assumption. 



Canonical correlation analysis (CCA) can be applied when a set of mea- 
surements from a sample X can be paired with another set of observations 
Y = {yi}i<i< n taken in a set y, and that the pairs (arj,j/,) are drawn from 
a i.i.d law. Such tasks appear typically when each index i refers to th e same 



underlying object cast in different modalities (jVert and Kanehisal . 120031 ) . CCA 
looks for meaningful relationships between X and Y by focusing on linear pro- 
jections of X and Y, a T X and (3 T Y , such that the correlation between a T X 
and P T Y is high. In mathematical terms this amounts to defining 

(a, 0) = argmax corrx y [a T , P T ) 

cov x ,Y[a T ,P T } ^ 
= argmax — — 

a <£X,pey vvarx [a T ] vary[/3 T ] 
where for two real valued functions / : X — > M and g : y — > R we write 

vwcx[f} = E x (f(x)-Ex[f(y)}) 2 , 
™ Y [g] = E x {g{y)-E Y [g(y)])\ 
covx,Y[f,9] = Ex,Yl(f(x)-Ex[f(x)])(g(y)-Ey[g(y)})}. 
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We observe that both optimizations look for vectors in X as well as y in 
the case of CCA that will be representative of the data dependencies. The 
three operators varx, vary and covx,y can be approximated by finite sample 
estimators, respectively 

n / 1 n \ 

var£[/]=]T --£/(*,) , 

»=1 \ 3=1 / 

n I 1 n \ 2 

var y [g] = ^ I g{ Vi ) ~ ~ E 9 ^ j ' 

COv£,y[/.ff] = E f/W-^E/^j ^(Wi)-^Ef(w)j • 

Generalization to functions in a rkHs: The "kernelization" of such algo- 
rithms is natural when considering the same critcrions on the mappings in TL 
of the random variables X and Y. We write for convenience Tix and Tiy for 
the rkHs associated with X and y with respective kernels fc;t and ky. If wc 
cast now the problem as that of estimating a functions / in Tix and a couple of 
functions (/, g) in TLx and 7iy respectively, we are now looking for vectors in 
such spaces - that is real- valued functions on X, and X x y respectively - that 
are directions of interest in the sense that they have adequate values according 
to the criterions defined in Equations (fTT|) and (fT2"|) . When considered on the fi- 
nite subspaces of TLk spanned by the datapoints, the two previous optimizations 
become 

fj = argmax var x [(/, k x (x, ■))« x } , 

feH x ,\\f\\ Hx =hf-L{fu-,fj-i} 

for 1 < j < n and 

(f,g)= argmax /^^K/^^-))^. (g^^O^L (13 ) 
fen x , a en y y / var x [(/, fc^(x, vary [(5, 



kerne l PC A: the first problem has been termed kernel-PCA bv lScholkopf et al 



(|1998I ) and boils down to the decomposition of the operator var^- into n eigen- 
functions 8 . This decomposition can be carried out by considering the n x n 
kernel matrix Kx of the n observations, or more precisely its centered counter- 
part 

K X = {In - -ln,n)Kx{I n - ~l n ,n)- 



"Note that kernelizing weighted PCA is not as straightforward and can b e only carried out 
throu gh a more generalized eigendecomposition, as briefly formulated in l lCuturi and Vertl 
I2005H 
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The eigenfunctions /, can be recovered by considering the eigenvalue/eigenvector 
pairs (ej, di) of Kx, that is such that 

K x = EDE T 

where D = diag(d) and E is an orthogonal matrix. Writing U = ED~ X / 2 we 
have that 

n 

./'(.; Y^r.J,.,;.-) (14) 

i=l 

with var™ [/, (»] = ^. 

kerne l CCA: the second optimization, first coined down kernel-CCA bv lAkahol 
( 20011 ). is ill-posed if Equation (fT5|) is used directly with a finite sample, and re- 



quire s a regularization as explained in (jBach and Jordan! . l2002l ; lFukumizu et al 



20071 ). Namely, the direct maximization 

C0Ti X Y [f, g] 



(/> 9) = argmax 



fax, gay \A<Mraf[5l 

is likely to result in degenerated directions where var^- [/] or vary [g] is close to 
zero, which suffices to maximize the ratio above. Instead, the criterion below, 

,. ^ corr" y [/, 5 ] 
U,9) = argmax - 



fex, g ey V(varJ[/] + A||/|P)(var y [ 5 ] + A|| 5 || 2 )' 

is known to converge to a meaningful solu tion when A dec r eases to zero as n 
grows with the proper convergence speed (jFukumizu et al.l . 120071 ). The finite 
sample estimates /" and g n can be recovered as 

n 

/"(■)= 

i=l 
n 

<?"(■) = E«-) 

i=l 

where £ and C, are the solutions of 

(e,0= ar § max i T K Y Rxt 

and 

1 ™ 

¥>*(•) = far (a*, •) Y] &#(a;i, ■), 

3=1 



1 " 

? T Z ^ 



n 

3=1 
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are the centered projections of (xi) and (yj) in Tix an d Hy respectively. The 



topic of supervised dimensionality reduction, explored in (jFukumizu et al.l . l2004l ). 
is also linked to the kernel-CCA approach. The author look for a sparse rep- 
resentation of the data that will select an effective subspace for X and delete 
all directions in X that are not correlated to paired observations in y, based 
on two samples X and Y. In linear terms, such a sparse representation can be 
described as a projection of the points of X into a subspace of lower dimension 
while conserving the correlations observed with corresponding points in y. 

4.3 Regression, Classification and other Supervised Tasks 

Suppose that we wish to infer now from what is observed in the samples X and 
Y a causal relation between all the points of X and y. This type of inference 
is usually restricted to finding a mapping / from X to y that is consistent 
with the collected data and has desirable smoothness properties so that it ap- 
pears as a "natural" decision function seen from a prior perspective. If X is 
Euclidian and y is R, the latter approach is a well studied field of mathematics 
known as approximation theory, rooted a few centuries ago in polynomial inter- 
polation of given couples of points, and develo ped in statist i cs thr ough spline 
regression (|Wahbal . [l990h and basis expansions (jHastie et al 1 I2001L §5). 



empirical risk minimization: statistical learning theory starts its course 
when a probabilistic knowledge abou t the generation of the p oints (x,y) is 
assumed, and the reader may refer to ( Cucker and Small l2002j ) for a valuable 



review. We skip its rigorous exposition, and favour intuitive arguments next. 
A sound guess for the learning rule / would be a function with a low empirical 
risk, 



1 . 

r p (/) d " -5> (/(*,), w), 

n £ — * 



quantified by a cost function c : y x y ^ M+ that penalizes wrong predictions 
and which is nil on the diagonal. Minimizing directly R™ p given training sets 
X and Y is however unlikely to give interesting functions for /. If the function 
class T from which / is selected is large, the problem becomes ill-posed in the 
sense that many solutions to the minimization exist, of which few will prove 
useful in practice. On the contrary, if the function class is too restricted, there 
will be no good minimizer of the empirical risk that may serve in practice. To 
take that tradeoff into account, and rather than constraining T ', assume that 
J : T — > R is a function that quantifies the roughness of a function which is 
used to penalize the empirical risk, 

n 

R c(f) = ~ E c (/(*<). W) + W). (15) 

i=l 

Here A > balances the tradeoff between two desired properties for the function 
/, that is a good fit for the data at hand and a smoothness as measured by J. 
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This formulation is used in most regression and classification settings to select 
a good function / as the minimizer of 



f = argmini? c 



(16) 



kernel classifiers and regressors: we recover through the formulation of 
Equation (JTSJ) a large variety of methods, notably when the penalization is 
directly related to the norm of the function in a rkHs: 

• When X is Euclidian and y = R, T = X* , the dual of X and c(f(x), y) = 
(y — f(x)) 2 , mini mizing is known as least-square regression when A = 0; 
ridge regr ession (lHoerill962l) when A > and J is the Euclidian 2-norm: 
the lasso ( Tibshirani . 19961 ) when A > and J is the 1-norm. 



When X = [0, 1], y = R, T is the space of m-times differentiable functions 
on [0, 1] and J = Jj ^ (/ < - m - ) (i)) dt, we obtain regression by natural splines 
of order m. This setting actually corresponds to the usa ge of th i n-bas e 
spli nes which can also b e regarded as a rkHs type method ( Wahba . 199Clh . 
see (jGirosi et all Il995l Table 3) for other examples. 



When X is an arbitrary set endowed with a kernel k and y = { — 1,1}, 
T = Hk, J = || ■ \\n k an d the hinge loss c( f(x),y) = (1 — yf(x)) + i s 
used, we obtain the support vector machine ( Cortes and Vapnikf . Il995f ). 
Using the cost function c(f(x),y) = ln(l + e~ v ^ x ^), yie lds an extension 
of log istic regression known as kernel logistic regression (jZhu and Hastid . 
20021) . 



• When X is an arbitrary set endowed with a kernel k and y = R, T = 
Ti-k, J = || ■ \\n k and c(f(x),y) = (\y - f(x)\ - e)+, the e-insensitive 
loss functi on, the solution to t his program is known as support vector 
regression (iDrucker et all Il997t) . 



Note that by virtue of the representer theorem, recalled above as Theo- 
rem [51 that whenever T is set to be a rkHs H, the mathematical program of 
Equation (| 16[) reaches its minima in the subspace H. n spanned by the kernel 
functionals evaluated on the sample points, that is 

/ G span/c(x i , ■), 

hence the function / in Equation (|15p can be explicitly replaced by a finite 
expansion 



/ = ^2a l k(x l , •), 



(17) 



i=i 



and the corresponding set of feasible solutions / G TL by / € 7i n and more simply 
a 6 R " using Equation (fl7)) . The reader may consult (jSteinwart and Christmann 
20081) for an exhaustive treatment. 
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kernel graph inference: wc quote another example o f a supervized rkHs 



metho d. In the context of supervised graph inference, IVert and Yamanishi 



( 20051) consider a set of connected points {xi}i<i< n whose connections are sum- 



marized in the combinatorial Laplacian matrix L of their graph, that is for i =/= j, 
Lij = — 1 if i and j are connected and otherwise, and Lij = — The 
authors look for a sequence of functions {/,}i<i<d of a rkHs Ttk to map the 
original points in M. d , and hope to recover the structure of the original graph 
through this representation. Namely, the projection is optimized such that the 
points, once projected in M. d , will have graph interactions in that metric (that 
is by linking all nearest neighbours up to some distance threshold) that will be 
consistent with the original interactions. This leads to successive minimizations 
that may recall those performed in kernel-PCA, although different in nature 
through the addition of a regularization term proportional to A: 

. fiLf x + \\\f\\ Hk 
fj = argmax — . 

feH k J-L{fu-,fi-i} JxJx 
where the vector fx is defined as 

/ x = (/(4-./W) T - 

The term fJ^Lfx above can be interpreted as a cost function with respect to the 
observable graph L, which penalizes functions / that arc for which the values 
of f(xi) and f(xj) are very different for two connected nodes. 

kernel discrimin ant analysis : wc recall briefly the ideas behind the Fisher 



linear discriminant lFishen (193a) for classification. Given a sample X = (x\ , • • • , x n ) 



of points in M. d and assume that to each point Xi corresponds a binary variable 
Hi G {0,1} which is equal to if Xi belong to a first class and 1 when Xi belongs 
to a second class. Fisher discriminant analysis (LDA) assumes that the condi- 
tional distributions p${X) = p{X\Y — 0) and Pi(X) = p(X\Y = 1) are both 
normal densities. If the mean and variances Ho,p,\ and E an d Si respectively 
of p and pi were known, the Bayes optimal rule would be to classify any ob- 
servation x according to the value of its probability ratio and predict it is 
in class whenever that ratio is below a certain threshold 

(x - n ) T T,Q 1 (x - no) +ln|S | - (x - /ii) T E7 1 (a; - p,{) - In < (3. 

If the two classes are homoscedastic, that is So = Si = E, then the decision 
can be simplified to testing whether whether lo t x < c where u> is defined asw = 
E _1 (/.*i — jUo). This later case is known in the literature as Linear Discriminant 
Analysis (LDA). When the latter assumption is not valid, Fisher proposed to 
find a vector u> that separates the two classes by optimizing the ratio 

(uj t Ho - bj T Hi) 2 (uj t (ho - Mi)) 2 

r(LU) - 
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The ratio r is a Rayleigh quotient whose maximum is the only nonzero eigenvalue 
of the generalized eigenvalue problem uio — ui)(no — u i) T , ^0 + ^a) which cor- 
responds to the eigenvector 

ijj = (S + Si) -1 (Mo - Mi)- 
In pract ice all quan t ities and fii are replaced by empirical estimators. As 



shown in lMika et al.l (|1999l ) , the criterion r can be conveniently cast as a quadratic 



problem in an arbitrary rkHs Ti corresponding to a set X. In this new setting, 
a sample X = (xi,--- ,x n ) of n points in X is paired with a set of labels 
(yij ' ' ' i Un)- Instead of looking for a vector u>, namely a linear function, kernel 
discriminant analysis looks for a function / £ Ti n such that 

,,x (/(MO) - /(Ml)) 2 

r / = F"> 

var / + vari / 

Let us write hq and n\ for the numbers of elements of X of class and 1 
respectively, where uq +ni = n. For functions / £ 7i n , namely functions which 
can be written as /(■) = Y17=i a ik( x i> ")i we have that 

»"(/) = 



a T Saa + a T S\a 



where writing 

K = [k^Xi, %jj\l<i,j<nt 
Kq = [k{x l ,Xj)]i<i^<n t y- =0 , 
K\ = [k^Xi, %j)]l<i,j<n,yj=l: 

allows us to express means and variances in of X evaluated in functions of Ti. n 

as 

mo = K l no , 
mi = Kil ni , 

50 = Kq(I Inomno)^! 

51 =K 1 (I-— l nuni )Kj. 

m 

Following the approach used above for linear functionals in Euclidian spaces, the 
vector a of weights could be recovered as the (only) nonzero eigenvalue of the n 
dimensional generalized eigenvalue problem ((mo + mi)(mo + mi) T , (So + Si)). 
However, as the reader may easily check, the matrix So + 5*i is not-invertible 
in the general case. Adding a rcgularization term XI n is a hack that makes the 
problem computationally tractable. It can also be motivated from a rcgular- 
ization point of view. Indeed, since we are looking to maximize the ratio, this 
modification is equivalent to adding the rkHs norm of f a to its denominator 
and hence favor functions with low norm. Although this explanation is not as 
motivated as the empirical regularization scheme discussed in Section 14.31 it is 
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the one provided in the original work of iMika et al. ( 19991 ). Note in particular 
that the represent theorem does not apply in this setting and hence looking for 
function in H. n is itself an arbitrary choice. The kernel discriminant is thus the 
function / such that 



/(•) = ^a,fc(a 



(So+St + XIn) (mo + TOi). 



4.4 Density Estimation and Novelty Detection 

A density estimator is an estimator based on a data sample of points drawn 
independently and identically distributed according to an unobservable under- 
lying probability density function. The level sets of the estimator are the sets 
of points in X for which the density of the estimator has values below or above 
a given threshold. Estimating level sets rather than a density estimator taken 
within a set of candidate densities is the nonparametric direction taken by the 
one-class support vector machine presented below. 



one-class SVM: taking advantage of the support vect or machine formulation 
to minimize the penalized empirical risk of Equation (|15p , Schokopf et al. ( 1999h 
proposed the reformulation 



^(/) = -E c (^)) + hi/iih- 

n — ' 

i=l 



where the labels of all points are set to 1 to estimate a function / that is 
positive on its support and that takes smaller values on areas of lower densities, 
c can be a ny convex func t ion d iffcrcntiablc at and such that c'(0) < 0. In 
particular, Schokopf et al. ( 19991 ) solves the following mathematical program 



minimize v\\f\\n + X)"=i(£ 
subject to / G H n 

/(zi)<p-&,6>0 



P) 



novelty detection and kernel-PCA: Novelty detection refers to the task 
of detecting patt erns in a given data se t that do not conform to an established 
normal behavior ( Chandola et al. . 20091 ). novelty detection can be implemented 
in practice by using the level sets of a density estimator. A new observation is 
intuitively labelled as abnormal if it lies within a region of low density of the 
estimator granted this new observation has been drawn from the same distri- 
bution. Another approach to novelty detection is given by a spectral analysis 
implemented through the study of the principal components of a data sample. 
This approach can be naturally generalized to a "kernelized algorithm" . 

Principal comp onent analysi s can be used as a novelty detection tool for 
multivariate data dJolliffd . l2002l §10.1) assuming the underlying data can be 
reasonably approximated by a Gaussian distribution. Given a sample X = 
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(xi, • • • , x n ) of n points drawn i.d.d from the distribution of interest in R d , the 
p first eigenvectors of PCA are defined as the p first orthonormal eigenvectors 
ei, • ■ ■ ,e p with corresponding eigenvalues Ai, • • ■ , X p of the sample variance ma- 
trix £„ = ^j-j- XXiEj — m)(xi — m) T where m = ^ X)"=i ^ ^ s ^ ne sam pl e mean. 
An observation y is labelled as abnormal whenever its projection in the space 
spanned by the p eigenvectors is markedly outside the ellipsoid defined by the 
semi-axes (e^, Aj), namely when 

ef(y-m) 
«— 1 * 

or alternatively when the contribution of the first p eigenvectors to the total 
norm of y is low compared to the weight taken by the other directions of lower 
variance, 

llvll 3 

\TiF2 5 a - 

\\y\\ 

This idea has been extended in the case of kernel-PC A in ( Hoffmann . l2007t ) by 



using a kernel k on X. In that case the linear functionals ef- are replaced by the 
evaluations of eigenfunctions /i(-) introduced in Equation (|14|) and the norm of 
y itself is taken in the corresponding rkHs TL and can be recovered as k(y,y). 
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5 Kernel Selection and Kernel Mixture 



An important issue that arises when using kernel machines in practice is to 
select an adequate kernel. In most practical cases choices are abundant if not 
infinite. We review three different families of techniques designed to cope with 
this situation. The following section is set in the usual classification setting 
where the dataset of interest is composed of n pairs of points and labels, i.e. 
{(xi, J/i)}i=i.. n where each x t e X and y t e {— 1, 1}. 



5.1 Parameter Selection 



When the kernel can be parameterized by a few variables, a brute force ap- 
proach that examines the cross-validation error over a grid of acceptable pa- 
rameters is a reasonable method which often yields satisfactory results. This 
approach is non-tractable when the number of paramete r s reaches but a few 
values. ( Chapelle et all 2002t Bousauet and Herrmann! . 120031 : iFrolich et al 



2004 ) and more recently ( Keerthi et all 2007t ) have proposed different schemes 



to tune the parameters of a Gaussian kernel on R . The authors usually assume 
a setting where the weights cr, assigned to each feature of two vectors x and y 
in R d need to be tuned, that is consider kernels of the form 



k(x,y) = exp 



E 



{xi - yif 



Finding an adequate parameter choice (a\, • • • , <7<j) implies defin ing first a crite- 



rion to discriminate good from bad choices for such parameters. IChapelle et al 
( 20021) consider the leave-one-out error of the regularized empirical risk formu- 
lation of Equation (|15[) . Given a kernel k a parameterized by a set of parameters 
(oi, • • • , ad), the leave-one-out error is the sum 



£loo(c) 



Itf-iiXi) £ Vl ), 



where we use a generic regularized empirical risk minimizcr estimated on all 
points of the sample but one: 



f- 



. 1 

argmm — 
fen n 



Y. c (/fe).%)+ AJ (/)- 

3=1 J¥=i 



lEvgeniou et al. ( 2004 ) show that the leave-one-out error is a good way to quan- 
tify the performance of a class of classifiers. If £loo was a tractable and ana- 
lytical function of a, it would thus seem reasonable to select a a s a minimizer of 
£7,00. This is not the case unfortuna tely. The authors of both (jChapelle et al 



2002- iBousauet and Herrmann , l2003h propose to consider instead upperbounds 



on £loo which are tractable and design algorithms to minim ize such upper- 
bounds through gradient descent methods. Keerthi et all (|20071 ) generalize this 
approach by considering other proxies of the performance of a kernel on a given 
problem. 
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5.2 Multiple Kernel Learning 



Rather than looking for a single kernel in a large set of candidates, a re- 
search trend initiated by lLanckriet et al ( 2004 ) proposes to consider instead 
combinations of candidate kernels. As recalled in Section 12.41 positive def- 
inite kernels can be combined multiplicatively (under point-wise multiplica- 
tion) an d linearly (throu g h pos itive linear combinations). Since the pioneering 



work of Lanckriet et al 



( 20041 ). which relied on expensive semi-definite pro- 
gramming to compute optimal linear combinations of kernels, the shift of study 
has progressively evolved towards c omputationally efficient alternatives to de- 
finc useful additive mixtures as in (jBach et all |2004 ISonnenburg et all 120061 : 
Rakotomamoniv et all 20071). A theoretical foundation for this line of research 
can be found in Micchelli and Pontil ( 20061 ). We follow the exposition used 



m ( Rakotomamoniv et al. . 20071 ). Recall, as exposed in Section fQl that given 
a kernel fc, kernel classifiers or regressors yield decision functions of the form 



/(*) 



i=l 



a*yik(x, xi) + b*, 



(18) 



where both the family (a*) and b* stand for optimized parameters. When 
not one, but a family of m kernels k%, . . . , k m kernels can be combined in a 
convex manner to yield a composite kernel k = J~]^Li diki with = 1, the 

task consisting in learning both the coefficients oti , b and the weights di in a 
single op timization problem is known as the multiple kernel learning (MKL) 
problem (jBach et all . 120041 ) . Writing H for the rkhs corresponding to kernel 
k, The penalized SVM-type optimization framework for the estimation of a 
function / in TL is 



minimize 1 1 / 1 1 ^ + C & 
subject to / e H, b e E, 

Vi, yi {f( Xi )+b) > 1 

&>o. 



6, 



When the kernel k is a mixture of m kernels, the authors propose the following 
optimization scheme, 



minimize 
subject to 



E^II/il&.+CEfc 
// e Hi, beR,d h ^, 

V« < n, yi(J2i fi( x i) + 

Ej4 = i; &,di>o. 



> 1-6, 



In the formulation above, each value di controls the importance given to squared 
norm of /; in the objective function. A bigger di favors functions whose com- 
ponent in Tii may have a larger norm. If the weight di goes to zero, the corre- 
sponding function // can only be zero as shown by the authors, which is indeed 
equivalent to not taking into account kernel ki in the sum ^ fc diki. The solution 
to the problem above can actually be decomposed into a two-step procedure, 
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namely by minimizing an objective function J{d) defined on the weights d = (di) 
and which is itself computed through a SVM optimization, namely: 

minimize J(d), 

subject to J2i di = 1; di > 0, 

J(d)= minimize Y,i T, II fl II W ; + c E 6 

subject to fi &Hi,b el, & 

Vi, + &)>!-&; 6 >o. 

(19) 

The authors iterate between the computation of the objective function J, itself 
a SVM optimization, and the optimization of J, carried out using projected 
Gradient methods. Each iteration of this loop involves the computation of the 
Gradient's directions which the authors show are simple functions of the 
weights a* retrieved during the SVM-computation conducted to compute J, 
namely 

The algorithm boils down to the following loop. 

• initialize all weights di to 1/m, 

• Loop : 

— compute an SVM-solution to the problem with fixed weights d. This 
gives J, as well as its associated Gradient directions 

— Optimize J with respect to d, that is replace the current weights 
family d by d + where D is the vector of descent direction com- 
puted from the Gradient (reducing it and projecting it) such that the 
new d satisfies the simplex constraints, and 7 is an optimal step size 
determined by line search. 

— Check for optimality conditions initially set, and if reached get out 
of the loop. 

By the end of the convergence, both the weights a and b that arise from the 
last computation of J, that is the SVM-computation step, and the weights d 
obtained in the end provide the parameters needed to define / in Equation (fT8|) . 
An additional property of the algorithm is that it tends to produce sparse pat- 
terns for d, which can be be helpful to interpret which kernels are the most 
useful for the given task. 

5.3 Families of Kernels Labeled on a Graph 

The approach taken in the latter section assumes that all kernels are indepen- 
dently selected. The optimization of Equation (|19|) is carried out on the linear 
subspace formed by all linear combinations of these m kernels. Rather than 
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treat ing all kernels uniformly and mixing them linearly, (jCuturi and Fukumizu , 
20071) consider a setting with two particular features. 

First, an a-priori knowledge on the structure on the kernels themselves can 
be used, namely a hierarchical structure under the form of a tree. Kernels are 
indexed by labels on a directed acyclic graph (DAG) {a e T}, and each kernel 
k a is related to its siblings, that is k/3,/3 £ s(a) where s(a) stands for the sons 
of a node a. An example of such an approach can be seen in Figure [4] where 
the hierarchy is a dyadic partition of the surface of an image. 
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Figure 4: The leftmost image represents the final nodes of the hierarchy considered, that is the 
4 3 squares present in the grid. The ancestors of such nodes correspond iteratively to the 16 larger 
square obtained when grouping 4 small windows, then to the image divided into 4 equal parts and 
finally the whole image. The hierarchy has thus a depth of 3. Any set of nodes taken in the hierarchy 
can in turn be used to compare two images under the light of those local color histograms displayed 
in the right most image, which reduces in the case of two-color images to binary histograms as 
illustrated in the right-most image. 



Second, the hierarchy can be used not only to combine kernels additively but 
also multiplicativcly. More precisely, the authors define the space of candidate 
kernels as the space S of all complete subtrees t of T starting with the same 
root. Such a tree t is uniquely characterized by its set of final nodes f(t), and 
the kernel associated to such a subtree is the product of the kernels associated 
to each final node, that is 

h = k a . 

«e/(t) 

Note that the number of potential subtrees grows super-exponentially, hence 
yielding a number of candidate kernels far superior to the total number of node 
kernels. 



Grounded on these two assumptions, ICuturi and Fukumizul (|2007l ) use a 
prior weight on subtree kernels to propose a fast computation of a kernel k 

as 



k = ^2 d t k t . 
tes 



The weight d t penalizes the complexity of a given subtree t by considering 
its number of nodes. In practice the weights dt are defined with a n analogy 
to branching process priors for trees (jJager 1 Il975h . iBachl (|2008ah proposed 
a similar setting to optimize directly the weights d t using a variation of the 
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Multiple Kernel Learning framework. The originality of the approach is to take 
advantage of the hierarchy between kernels to adaptively explore subtrees which 
may fit better the task at stake. 
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6 Kernel Cookbook 



We review in this section practical guidelines that can apply to the selection of 
a kernel given a datasct of structured objects. 

6.1 Advanced Kernels on Vectors 

Vectors of M. d can be directly approached by considering the linear dot-product 
as a kernel, which amounts to performing an altern ative penalized regression 



and optimizing it in the dual space as is described in IChapelld (|2007l) . 

Beyond the use of the linear kernel, the array of positive definite kernels 
defined on vectors if not on scala rs is very l a rge, a nd include functions of all 



possible shapes as illustrated in (jBerg et al.l . 11984 Exercise 2.12, p. 79). Al- 



thoug h some specific k e rnels have been used for their precise invariance prop- 
erties Franc and Sahbil (|2003h . most practitioners limit themselves to the use 



of Gaussian and polynomial kernels. Once a family of kernel has been se- 
lected, the topic of choosing adequate parameters for this kernel is itself one 
of the biggest challeng es when using kernel methods on vectorial data, as hinted 



in (jHastie et al.l . 1200 ll . Section 12.3.4). For polynomial kernels searches are usu- 



ally limited to the offset and exponent parameters. In the case of Gaussian 
kernels, usually favored by practitioners, the more general use of Mahalanobis 
distances instead of the simple Euclidian distance, that is kernels of the form 

where £ is a d x d symmetric positive definite matrix, has also been investigated 
to fit better data at hand and to insist on the possible correlations or importance 
of the described features. The simplest type of matrices £ which can be used is 
one with a diagonal structure, and pre-whitening the data might be considered 
as such an approach. More advanced tuning strategies have been covered in 
Section [5. II 



6.2 Kernels on Graphs 

Labeled graphs are widely used in computer science to model data linked by 
discrete dependencies, such as social networks, molecular pathways or patches in 
images. Designing kernels for graphs is usually done with this wide applicability 
in mind. 

A graph G is described by a finite set of vertices V and a hence finite set of 
edges E = V x V. Graphs are sometimes labelled. In that case there exists a 
function of E to the set of labels £, or alternatively V to £ that assigns a label 
to a node or an edge. 

Given an arbitrary subgraph / and a graph of interest G, the feature f{G) 
measuring how many subgraphs of G have th e same structure as gra ph / is a use- 
ful elementary feature. The original paper bv lKashima et al" I (|2003l) presented in 



Scction l3~2l uses for the set of subgraphs / simple random walks and counts their 



39 



co-occurrenc es to provid e a kernel, an approach that had also been studied in the 



_2_p: 

case of trees ( Vertl . 2002 ) . The work has found extensions in ( Mahe et al. , 20051 ) 



to take better into account similarity between not only the graph structure but 
also the labels that populate it. More advanced sets of features, w hich rely on al 
eebraic descriptions of graphs have been recently consider ed in dKondor et"af 
120091) . We refer the reader to the exhaustive review of IVishwanathan et al 



6.3 Kernels on Images 

Technically speaking, an image can be seen as a long 3-dimensional vector of 
RGB intensities. It is however unlikely that treating images as vectors and 
applying Gaussian kernels on them will yield any interesting result. In that 
sense, the definition of kernels for images are build on higher-level properties 
of images and images contents, such as the invariance to slight translations in 
both color intensities and patterns positions in the image. These properties can 
be translated into the following kernels. 



color histograms: numerous approaches have stemmed from the use of color 
histograms t o build kernel on ima ges, starting with the seminal experiments 
carried out bv lChapelle et al. ( 19991 ). By representing an image I by an arbitrary 
color histogram 6i e E<j, where d stands for the color depth (typically 256, 4096 
or 16 million), the authors follow by designing a kernel on two images using 
kernels on multinomials such as those presented in Section I3.1.2| typically 



Note that this approach assumes a total invariance under pixel translation, 
which is usually a drastic loss of information on the structure and the content 
of the image itself, as illustrated in Figure [5j 

Further developments have tried to cope with this this limitation. Rather 
tha n considering a single hi s togra m for each image iGrauman and Darrelll (|20051 ) 
and lCuturi and Fukumizul (|20071 ) divide the image into local patches and com- 
pare the resulting families of local histograms. These approaches provide sub- 
stantial improvements at a low computational cost. 



taking shapes taken into account: note first that by counting elementary 
shapes in each image, rather than colours, the techniques described in the para- 
graph above can be applied to define kernels that focus on shape similarities. 
However, when larger shapes are the main criterio n to discriminate two images, 
histogram representations have obvious limitations. Haasdonk and Kevsersl ( 2002 ) 
propose a kernel which exponentiates the opposite of the tangent distance be- 
tween two images to quantify their similarity. Since the computation of the tan- 
gent distance requires the optimization of a criterion, the fram ework is related 
to other attempts at des i gning a kernel fr o m of distance , e.g (IWatkinsI l2000l ; 
Shimodaira et all . l2002at IVert et all . 12004 ICuturil . 120071 ; ICuturi et alTl2007l ). 
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Figure 5: A complex image such as the monkey above can be summarized through color his- 
tograms, represented, as above, as a 3D histogram of red/green/blue intensities. Although this 
representation inquires considerable loss of information, it is often used for image retrieval. 
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The tangent distance (jSimard et al.1 . Il998h is a distance computed between two 
shapes x and y to assess how different they arc to each other by finding an opti- 
mal series of elementary transformations (rotations, translations) that produces 
y when starting from x. Since the distance is not negative-definite the tangent 
distance does not yield directly a positive definite kernel, but might be used in 
practice with most kernel machines after an adequate correction. 



shapes seen as graphs: taking such limitations into account but still willing 



to incorporate a discrimination based on shapes, lHarchaoui and Bachl (|2007t ) 
have exploited existing graph kernels to adapt them to i mages. Images can in- 
deed b e seen as large graphs of interconnected color dots. lHarchaoui and Bach 
( 20071 ) propose to segment first the images through standard techniques (in the 
quoted paper the authors use the watershed transform technique) into large 
areas of homogeneous colors, and then treat the resulting interconnections be- 
tween colored areas as smaller graphs labeled with those simplified colors. The 
two graphs are subsequently compared using standard graph kernels, notably 
a variation pr oposed by th e authors. When the number of active points in the 
images is low, I Bachl (12008a) focus on a specific category of graph kernels tailored 
for point clouds taking values in 2D or 3D spaces. 



6.4 Kernels on Variable-Length Sequential Data 

Variable-length sequence data-types are ubiquitous in most machine learning 
applications. They include the observation sampled from a discrete-time pro- 
cesses, texts as well as long strings such as protein and DNA codes. One of 
the challenges of designing kernels on such objects is that such kernels should 
be able to compare sequences of different lengths, as would be the case when 
comparing two speech segments with different sampling frequencies or overall 
recorded time, two texts, or two protein with different total number of amino 
acids. 



kernels on texts: most kernels used in practice on texts stem from the use 
of the popular bag-of-words (BoW) representations, that is s parse word count 
vectors taken against very large dictionaries. The monograph \ Joachims! . I2OO2T) 
shows how the variations of the BoW can be used in conjunction with simple 
kernels such as the ones presented in Section 13.1.21 From a methodological 
point of view, much of the approach relies rather on choosing efficient BoW 
representations and on the contrary usually boil down to the use of simple 
kernels. 



histograms of transitions: when tokens arc discreet and few, the easiest ap- 
proach is arguably to map them as histograms of shorter substrings, also known 
as n-grams, a nd compare those h istograms directly. This approach was initially 



proposed by ([Leslie et all 120021 ) with subsequent refin ements to either incor- 



porate more knowledge about the tokens transitions (|Cuturi and Vertl . 12005 ; 
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Leslie et al, 



20061) . 



20031 ) or improve computational speed (|Teo and Vishwanathan 



higher level transition modeling with HMM's: rather than using simple 
n-gram counts descriptors, iJaakkola et al. (2000) use more elaborate statistical 
models to define kernels between strings which can be modelled as HMM. The 
interested reader may refer to Section 13.21 for a review of the Fisher kernel to 
see how the HMM model is used to build a feature vector to compare strings 
directly. 



edit distances: a different class of kernels can be build using transformations 
on the sequences themselves, in a form that echoes with the Tangent distance 
kernel presented in an earlier section. Intuitively, if by successive and minor 
changes one can map a sequence x to another sequence y, then the overall cost 
(which remains to be defined) needed to go from x to y can be seen as a good 
indicator of how related they are to each other. As with the tangent distance 



reviewed above, IShimodaira et al.l (|2002bf ) take into account the optimal route 



from x to y, whose total cost is known depending on the application field as 
the edit distance, the Smith- Waterman score, Dynamic-Time- Warping or Lev- 
enshtein distance, to define a kernel which is not necessarily posit ive-definite 
but w hich performs reasonably well on a speech discrimation task. IVert et al 



( 2004f ) argue that by taking a weighted average of the costs associated to all 
possible transformations mapping x to y, one can obtain a kernel who is pos- 
itive definite and which usually performs better on the set of proteins they 
co nsider in their study . Up to a few subtleties, a similar approach is presented 
in (jCuturi et al.l . 120071 ) which shows good performance on speech data. 
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